Project 4: Exploratory Data Analysis

Explore and Summarise Data

Data Analyst Nanodegree (Udacity)

Project submission by Edward Minnett (ed@methodic.io).

February 15th 2017 (Revision 1)


Univariate Plots Section

Initial Summary Exploration

When exploring data for the first time, it helps to get a very high level view of the whole data set in the hope of getting an idea where to zoom in and explore in more detail.

To begin with, what is the size and shape of the data set? This summary includes the head of the data frame.

White Wine

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Red Wine

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The data for both red and white wines are described by 12 variables. There are 4898 observations for the white wine, but only 1599 observations for the red wine.

What are the summary statistics for each feature?

##                      vars    n   mean    sd median trimmed   mad  min    max  range skew kurtosis   se
## fixed.acidity           1 4898   6.85  0.84   6.80    6.82  0.74 3.80  14.20  10.40 0.65     2.17 0.01
## volatile.acidity        2 4898   0.28  0.10   0.26    0.27  0.09 0.08   1.10   1.02 1.58     5.08 0.00
## citric.acid             3 4898   0.33  0.12   0.32    0.33  0.09 0.00   1.66   1.66 1.28     6.16 0.00
## residual.sugar          4 4898   6.39  5.07   5.20    5.80  5.34 0.60  65.80  65.20 1.08     3.46 0.07
## chlorides               5 4898   0.05  0.02   0.04    0.04  0.01 0.01   0.35   0.34 5.02    37.51 0.00
## free.sulfur.dioxide     6 4898  35.31 17.01  34.00   34.36 16.31 2.00 289.00 287.00 1.41    11.45 0.24
## total.sulfur.dioxide    7 4898 138.36 42.50 134.00  136.96 43.00 9.00 440.00 431.00 0.39     0.57 0.61
## density                 8 4898   0.99  0.00   0.99    0.99  0.00 0.99   1.04   0.05 0.98     9.78 0.00
## pH                      9 4898   3.19  0.15   3.18    3.18  0.15 2.72   3.82   1.10 0.46     0.53 0.00
## sulphates              10 4898   0.49  0.11   0.47    0.48  0.10 0.22   1.08   0.86 0.98     1.59 0.00
## alcohol                11 4898  10.51  1.23  10.40   10.43  1.48 8.00  14.20   6.20 0.49    -0.70 0.02
## quality                12 4898   5.88  0.89   6.00    5.85  1.48 3.00   9.00   6.00 0.16     0.21 0.01
##                      vars    n  mean    sd median trimmed   mad  min    max  range skew kurtosis   se
## fixed.acidity           1 1599  8.32  1.74   7.90    8.15  1.48 4.60  15.90  11.30 0.98     1.12 0.04
## volatile.acidity        2 1599  0.53  0.18   0.52    0.52  0.18 0.12   1.58   1.46 0.67     1.21 0.00
## citric.acid             3 1599  0.27  0.19   0.26    0.26  0.25 0.00   1.00   1.00 0.32    -0.79 0.00
## residual.sugar          4 1599  2.54  1.41   2.20    2.26  0.44 0.90  15.50  14.60 4.53    28.49 0.04
## chlorides               5 1599  0.09  0.05   0.08    0.08  0.01 0.01   0.61   0.60 5.67    41.53 0.00
## free.sulfur.dioxide     6 1599 15.87 10.46  14.00   14.58 10.38 1.00  72.00  71.00 1.25     2.01 0.26
## total.sulfur.dioxide    7 1599 46.47 32.90  38.00   41.84 26.69 6.00 289.00 283.00 1.51     3.79 0.82
## density                 8 1599  1.00  0.00   1.00    1.00  0.00 0.99   1.00   0.01 0.07     0.92 0.00
## pH                      9 1599  3.31  0.15   3.31    3.31  0.15 2.74   4.01   1.27 0.19     0.80 0.00
## sulphates              10 1599  0.66  0.17   0.62    0.64  0.12 0.33   2.00   1.67 2.42    11.66 0.00
## alcohol                11 1599 10.42  1.07  10.20   10.31  1.04 8.40  14.90   6.50 0.86     0.19 0.03
## quality                12 1599  5.64  0.81   6.00    5.59  1.48 3.00   8.00   5.00 0.22     0.29 0.02

Before plotting the features, it is worth doing an analysis to see if there are any obvious outliers in the data. For this analysis, I will be using Cook’s distance based on a linear model for the quality feature for each of the two types of wine with a threshold of 1 for points that exert disproportional influence on the model.

White Wine

Red Wine

This leaves us with a single outlier within the white wine data. From this point on, this datum will be excluded from the analysis.

Now let’s take a look at the general distribution for all the white wine features.

And the general distribution for all the red wine features.

These faceted histograms give us a good indication of the distribution for each of the 12 features for both datasets. Nearly all of the histograms are skewed to the left with more outliers in the right-hand tails. The notable exceptions are pH which appears to be reasonably normally distributed as is density for red wine. The quality histograms immediately stand out because that feature for both wines only contains integer values between 3 and 9 for white wines and 3 and 8 for reds. The disparity between the number of observations for whites compared to reds becomes very clear. Some of the plots for the red wines are much less clearly defined because there are so many fewer observations. The citric acid plot, in particular, depicts a distinct lack of structure in the distribution for the red wines.

Let’s take a closer look at the citric acid for red wines and see if we can clarify the shape of the distribution.

It appears that citric acid for red wine is somewhat uniformly distributed with a few peaks at 0, 0.2, 0.24, and 0.49. The values begin to tail off above 0.5 without any values between 0.79 and an outlier at 1.0.

Univariate Analysis

What is the structure of your dataset?

The two datasets contain a total of 6497 observations with 4898 for white wines and 1599 for reds. Each observation is described by 12 variables: (this description of the variables comes from the data description authored by Cortez et al.)

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume)
  12. quality (score between 0 and 10). This is the output variable (based on sensory data).

All 12 variables are numeric. The following variables represent integer values: free sulfur dioxide, total sulfur dioxide, and quality. The other 9 variables represent floating point numbers.

What is/are the main feature(s) of interest in your dataset?

There isn’t a particular feature of interest in the data that stands out. What I am interested in finding out whether there are any distinct differences in the physical characteristics between white and red wine. Just as importantly, I would like to know if there is a strong correlation between any of the physical characteristics of the wine and the perceived quality of that wine. If there are, are these physical qualities different for white and red wines?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Only further investigation will determine if this is true, but I have a feeling that extreme values in the physical characteristics of a wine will negatively impact the quality. This is merely a conjecture, but I think if the acidity is too low or two high or the sulphur dioxide is too low or too high, this is likely to lead to particularly low scoring wines. If this is true, I imagine then that wines that tend reside near the peaks for each of the physical characteristics will have above average quality scores.

Did you create any new variables from existing variables in the dataset?

No. I couldn’t think of a characteristic of the data that needed to be described by a new variable composed of the existing variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The data is already tidy, so further tidying wasn’t needed.

I performed outlier analysis using Cook’s distance based on a linear model for the quality feature for each of the two types of wine with a threshold of 1. This analysis found a single outlier in the white wine data it has been removed for all subsequent analysis.

Of all 24 distributions the citric acid observations for red wine required further analysis. This is primarily because the distribution wasn’t clear when the data was plotted as a facetted set of histograms. Even when the bin width was decreased to get a better sense of the distribution’s shape, it was found that distribution lacked a coherent modal or even multi-modal shape.

Bivariate Plots Section

Correlations

White Wine Feature Correlations

The White Wine features don’t appear to be very highly correlated with each other. There are only three pairs of features with a Pearson product-moment correlation coefficient (r) score greater than 0.5 and only 1 pair with a score less than 0.5.

  • Density / Residual Sugar: 0.83
  • Total Sulfur Dioxides / Free Sulfur Dioxide: 0.62
  • Density / Total Sulfur Dioxide: 0.54
  • Density / Alcohol: -0.8

Red Wine Feature Correlations

The Red Wine features don’t appear to be very highly correlated with each other either. Like the White Wines, there are only three pairs of features with a Pearson product-moment correlation coefficient (r) score greater than 0.5, but there are four pairs with a score less than or equal to 0.5. Interestingly, there are only two that overlap with the White Wines.

  • Density / Fixed Acidity: 0.67
  • Citric Acid / Fixed Acidity: 0.67
  • Total Sulfur Dioxides / Free Sulfur Dioxide: 0.67
  • Density / Alcohol: -0.5
  • Citric Acid: / pH: -0.54
  • Citric Acid / Volatile Acidity: -0.55
  • pH / Fixed Acidity: -0.68

The largest correlation scores with the quality of each of the types of wine are with the alcohol content (r 0.44 for White Wine and r 0.48 for Red Wines). These aren’t a particularly large scores and likely shed more light on how some of the reviewers providing the quality scores prefer stronger drinks than the wine itself.

White Wines

What do the four strongest relationships in the white wines data look like (from the strongest positive correlation to the strongest negative)?

Density / Residual Sugar: r 0.83

Total Sulfur Dioxides / Free Sulfur Dioxide: r 0.62

Density / Total Sulfur Dioxide: r 0.54

Density / Alcohol: r -0.8

Red Wines

What do the seven strongest relationships in the red wines data look like (from the strongest positive correlation to the strongest negative)?

Density / Fixed Acidity: r 0.67

Citric Acid / Fixed Acidity: r 0.67

Total Sulfur Dioxides / Free Sulfur Dioxide: r 0.67

Density / Alcohol: r -0.5

Citric Acid: / pH: r -0.54

Citric Acid / Volatile Acidity: r -0.55

pH / Fixed Acidity: r -0.68

Quality

We have seen that alcohol has the strongest relationship with quality for both types of wine. Let’s take a look at what these two distributions look like.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Unfortunately, there doesn’t appear to be a strong relationship between the quality of the wine and any of its physical characteristics. The strongest of these relationships is the one between alcohol and quality though the correlation r score is only 0.44 for white one and 0.48 for red. This suggests that there may be a bias for some of the reviewers toward a stronger wine appearing to be of higher quality though this isn’t something that can be confirmed with the data available (as we can’t determine which reviewed which wine). When tasting wine, the quantity of alcohol is one of the least subtle qualities to detect. It is possible that reviewers latched onto this quality to differentiate their preference for the different wines.

The fact that there isn’t a clear relationship between quality and any specific physical characteristic does tell something about how wine is perceived. It is plausible that the physical character of the wine isn’t the primary predictor of quality. There are likely to be other confounding factors not captured by this data. These could include the colour, context, shape of the glass, or whether the wine was perceived to be cheap or expensive.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It isn’t surprising that the strongest bivariate relationships in the data follow physical relationships in the chemistry and fermentation of wine. The strong relationship between density and residual sugar suggests that the density of the wine is highly influenced by the measure of the residual sugar. Given that the fermentation process causes sugar to turn into alcohol, it makes sense that there is an inverse relationship that is nearly as strong between density and alcohol. There is also a reasonably strong relationship between total sulfur dioxide and free sulfur dioxide (r 0.62 for white wine and 0.67 for red wine). For red wine, there is a strong positive relationship between citric acid and fixed acidity but a moderately strong inverse relationship between citric acid and volatile acidity along with reasonably strong negative relationships between pH and citric acid and pH and fixed acidity. This makes sense as all of these chemical properties are associated with each other.

What is interesting is that the relationships between the acidic features are less prominent for white wine than they are for red wine.

What was the strongest relationship you found?

The strongest relationship I found was between density and residual sugar for white wine. These two features had the largest Pearson r score of 0.83. This was closely followed by a negative correlation between alcohol and density (also for white wine) with an r score of -0.8.

Multivariate Plots Section

We have established that the relationships between quality and the physical features of the wine aren’t very strong, but they do exist. Not only is there some correlation between the quality and physical features, but the nature of these relationships differ for red and white wines. We have also established that strongest relationship is between alcohol and quality so now let’s take a look at the next three strongest relationships for each of the two types of wine, plot them against alcohol and colour the plot by the quality for each datum.

Let’s see what we can find.

White Wines

For white wines the four strongest relationships with quality are:

  • Alcohol (r 0.44)
  • Density (r -0.31)
  • Chlorides (r -0.21)
  • Total Sulfur Dioxide (r -0.17)

The three resulting plots are as follows:

Red Wines

For red wines the four strongest relationships with quality are:

  • Alcohol (r 0.48)
  • Volatile Acidity (r -0.39)
  • Sulphates (r 0.25)
  • Citric Acid (r 0.23)

The three resulting plots are as follows:

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

By looking at the relationships between alcohol (the feature with the strongest relationship with quality) and features with the 2nd through 4th strongest relationships with quality for both white and red wines, we begin to get a sense of what influences the quality score for a given wine. The correlation matrices for each type of wine showed us that these relationships differ for white and red wines. These multivariate plots tell a subtle story. The strength of the relationship between alcohol and quality (common for all six plots) is quite clear. It is the relationship between quality and the other six features that is far more subtle.

There is a clear negative correlation between density and alcohol for white wine (r -0.8). As the alcohol content increases so does the quality which means that as the density decreases, the quality increases. This is to be expected given each feature’s correlation to each of the others.

White wines with high chloride levels tend to both be lower quality and have a lower percentage of alcohol.

White wines high or low leaves of total sulfur dioxide also tend to receive lower scores for quality. Even if this is the case, of the relationships discussed so far, this is the least clear.

The relationships for red wine are a bit easier to read as there is less data being plotted. Red wines with a high volatile acidity tend to be lower quality and contain a medium amount of alcohol at most.

Red wines that receive the highest quality score tend to have a low to medium quantity of sulphates.

Of all six plots, the plot of the relationship between citric acid and alcohol for red wines is the most spread out. Citric acid is most evenly distributed for wines with lower levels of alcohol wheres higher alcohol wines tend to have citric acid levels below 0.125 or between 0.3 and 0.7 g / dm^3.

Were there any interesting or surprising interactions between features?

When plotting chlorides and alcohol by quality for white wines, there is a distinct spike in the variability of a wine’s chloride levels when its alcohol is between 8.5 and 10. For these wines that have elevated chlorides, their quality tends to be low or medium (between 3 and 6 out of 9).

My idea that extreme values of sulfur dioxide would result in lower quality scores is validated (to some extent) when plotting total sulfur dioxide and alcohol by quality for white wines. For this plot, the highest quality wines tend to alcohol levels above 10% and total sulfur dioxide levels between 75 and 200 mg / dm^3.

The plots for chlorides vs alcohol for white wine and sulphates vs alcohol for red wine have a very similar shape though the shape is less clear for red wines. This is probably because there is less data for red wines. The greatest variability of sulphates in red wine appears to occur in wines with less than average alcohol content. Though the wines with large amounts of sulphate appear to be lower quality, this relationship is less clear than that of chlorides and quality for white wine.


Final Plots and Summary

Plot One

One of the most interesting concepts pursued during this exploration is the idea that white wines and red wines differ at a much more fundamental level than their colour. To begin with, the reviewers who scored the wines tend to be more critical of red wines than white wines. This plot compares the kernel density estimates for distribution of quality scores for the two types of wines. This is an effective way of comparing these two distributions as it normalises the differences that are introduced by the disparity in size of the two data sets. Comparing two histograms would be fruitless when there are more than twice as many white wines in the data.

From this plot, we can clearly see that white wines are more likely to receive a score of 6 or higher than reds. White wines are also more likely to receive a very high score where reds are far more likely to receive a score between 5 and 7 with many more receiving 5 than 7. It is hard to see in this plot as the number of wines receiving a score of 9 is very small, but all the wines receiving a quality score of 9 are white wines.

Plot Two

Though it may be tricky to read, this ‘joined’ correlations matrix is the most illuminating description of the two wines data sets, how they are similar, and more importantly, how they differ. I describe this plot as a ‘joined’ correlation matrix because it is two correlation matrices rendered as one. Typically, a correlation matrix is symmetrical along the diagonal, but instead of showing this redundant information, I instead chose to render one triangle as the correlation matrix for the white wines and the other triangle for the red wines. The lower left triangle os for the red wines and the upper right triangle is for the white wines. A correlation matrix normally has a value of ‘1’ along the diagonal as a feature is entirely correlated with itself. As the diagonal is ambiguous for the two data sets, I have chosen to eliminate this data and treat it is a separation between the two triangles.

The juxtaposition of correlation values clearly depict the similarities and differences between the two types of wine. The large majority of relationships in the data are uncorrelated or very weakly correlated (either positively or negatively). Nearly every correlation for the two types of wine have the same sign, but the striking differences are the differing strengths of the correlations. The two largest values are both in the white wines triangle while the red wine data includes a larger number of correlations with a magnitude large than or equal to 0.5.

Plot Three

With this third plot, I wanted to illustrate how white wines and red wines are subtly different in their physical characteristics and how this impact their quality. The lack of clear relationships in the data made this task very difficult. The fact that the quality of white and red wines is influenced by different physical characteristics makes direct comparison quite challenging. I chose to compare the two features that influence red and white wine quality the most but only apply to one type of wine or the other. As we discovered earlier, the quality for both types of wine is most influenced by the quantity of alcohol in them. The next most influential feature for white wine is density and for red wine, it is citric acid.

I tried to layer alcohol and quality information into the plot by using those values to control the alpha and size of the points. Though this information isn’t clear for the points in the central cluster, it is possible to discern these details for points at the periphery, so I chose to keep the information as a part of the plot.

Apart from the general trend of white wine having a larger range of density than red wine and white wine having a larger cluster of data with similar quantities of citric acid, a very striking feature of this plot is the disproportional amount of data with ‘round’ values for citric acid. There are very clear bands of wine with citric acid values of 0, 0.5 and 0.75 g/dm^3. I suspect there is a similar band at 0.25 g/dm^3 though it is hard to be sure as the band appears to be obscured by the cluster of data with this level of citric acid.


Reflection

This data appears to validate the notion that wine flavours can be very subtle and it takes a discerning palette to differentiate between the traits of different wines. Though there are relationships between the different physical characteristics of a wine and its quality, these relationships are quite weak. Even alcohol, with the strongest relationship with quality, still has a Pearson product-moment correlation coefficient (r) score less than 0.5 for both white and red wines.

These subtle relationships made this exploration quite challenging. Without clear relationships to latch on to, the exploration proved quite nebulous and vague. The clearest finding from the exploration is that the difference between white and red wines isn’t merely their colour. The characteristics that help differentiate a low-quality wine and a high-quality one are different for red and white wines though these relationships are weak and may be ignored if the data contained more striking correlations.

Rendering the joint correlations plot was a struggle and took quite a lot of tweaking. I’m not entirely happy with it. Ideally, I would like the classification of the two triangles (red and white) to be clear visually (rather than requiring a textual description) but was unable to render the distinction as I had wanted. For some reason, lines I was drawing over the top of the plot were getting clipped and not displaying half down the last row of the plot. I felt like my attempts looked messy, so I decided not to pursue it further.

Rendering the correlation matrices were the most illuminating part of the exploration process. Seeing the differences between the correlation values for white and red wines was the first place I saw evidence that the physical characteristics of the wines not only differ between red and white wines, but these characteristics do have differing impacts on the perceived quality of the two types of wine.

I would be interested in performing further analysis with wine data that included other data that would likely influence perceived quality such as perceived flavour (sweet, dry, etc.), colour, context, shape of the glass, or whether the wine was perceived to be cheap or expensive. It would also be interesting to know if individual reviewers reviewed multiple wines and whether there were trends in their reviews. For example, were individual reviewers more or less sensitive to certain physical features of the wine?

Resources